An Automatic Close Copy Speech Synthesis Tool for Large-Scale Speech Corpus Evaluation
نویسندگان
چکیده
The production of rich multilingual speech corpus resources on a large scale is a requirement for many linguistic, phonetic and technological tasks, in both research and application domains. It is also time-consuming and therefore expensive. In particular the human component in the resource creation process is prone to inconsistencies, a situation which has frequently been documented in studies of cross-transcriber consistency in manual time-aligned signal annotation. In the present case, corpora of three languages were to be evaluated and corrected: (1) Polish, a large automatically annotated and manually corrected single-speaker TTS unit-selection corpus in the BOSS Label File (BLF) format, (2) German and (3) English, the second and third being manually annotated multi-speaker story-telling learner corpora in Praat TextGrid format. A method is provided for supporting the evaluation and correction of time-aligned annotations for the three corpora by permitting a rapid audio screening of the annotations by an expert listener for the detection of perceptually conspicuous systematic or isolated errors in the annotations. The criterion for perceptual conspicuousness was provided by converting the annotation formats into the interface format required by the MBROLA speech synthesiser. The audio screening procedure is complementary to other methods of corpus evaluation and does not replace them. Conceptually the ACCS synthesis tool is intended as an extension of the BLARK toolkit for speech corpora. 1. Efficient quality control of richly annotated corpora The production of rich multilingual speech corpus resources on a large scale is a requirement for many linguistic, phonetic and technological tasks, in both research and application domains. It is also time-consuming and therefore expensive. In particular the human component in the resource creation process is prone to inconsistencies, a situation which has frequently been documented in studies of cross-transcriber consistency in manual time-aligned signal annotation, particularly with prosody (Grice 2006; Gibbon et al. 1997; Gibbon et al. 2000). In the present case, corpora of three languages were to be evaluated and corrected: (1) Polish, a large automatically annotated and manually corrected single-speaker TTS unit-selection corpus in the BOSS Label File (BLF) format (Demenko et al. 2006), (2) German and (3) English, the second and third being manually annotated multi-speaker story-telling learner corpora in Praat TextGrid format (Boersma 2001; Gut et al. 2004). The first general goal is to provide a method for supporting the evaluation and correction of time-aligned annotations for the three corpora by permitting rapid audio screening of the annotations by an expert listener, who detects perceptually conspicuous systematic or isolated errors in the annotations. The criterion for perceptual conspicuousness is provided by converting the annotation formats into the interface format required by a suitable speech synthesizer, in this case the PHO format required by the MBROLA synthesizer (Dutoit & al. 1996). The audio screening procedure is complementary to other methods of corpus evaluation and does not replace them. Functionally, the ACCS synthesis tool is intended as an addition to the BLARK (Krauwer 2005) toolkit for speech corpora. A second general goal is a practical one: the method is also intended for use with corpora for less-resourced languages and for use in areas with very basic infrastructures. The method should therefore not only be of good quality and well-defined, but at the same time straightforward and as far as possible not dependent on complex cutting edge tools, expensive software packages, or the internet. Otherwise, usability by development teams working with sub-optimal infrastructures in an under-resourced languages paradigm is not assured. The focus in the present application is on segmental annotation evaluation, but the pitch pattern of the original speech signals was also extracted and mapped into the synthesiser interface in order to provide as natural a re-synthesis as possible for the segmental evaluation. Prosodic annotation is not evaluated. The issues addressed are: 1. Consistency of labels used with defined label set (e.g. phoneme or phone set). 2. Correct time-stamp assignment (e.g. segment duration). 3. Correct label selection from the relevant inventory. The first of these problems is a „syntactic‟ issue which can be dealt with automatically, given a specified inventory of labels. The second and third are „semantic‟ issues which require an element of subjective assessment, in that mapping of the annotation to the speech signals is involved. This assessment can be cross-transcriber
منابع مشابه
A Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation
Abstract Recent developments in robotics automation have motivated researchers to improve the efficiency of interactive systems by making a natural man-machine interaction. Since speech is the most popular method of communication, recognizing human emotions from speech signal becomes a challenging research topic known as Speech Emotion Recognition (SER). In this study, we propose a Persian em...
متن کاملOverview of SHRC - Ginkgo speech synthesis system for Blizzard Challenge 2013
This paper introduces the SHRC-Ginkgo speech synthesis system for Blizzard Challenge 2013. A unit selection based approach is adopted to develop our speech synthesis system using audiobook speech corpus. Aiming at roughly labeled corpora with several hundred hours of speech, our system adopts lightlysupervised acoustic model training of speech recognition to select clean speech data with accura...
متن کاملSyllable detection in read and spontaneous speech
Automatic syllable detection is an important task when analysing very large speech corpora in order to answer questions concerning prosody, rhythm, speech rate, speech recognition and synthesis. In this paper a new method for automatic detection of syllable nuclei is presented. Two large spoken language corpora (PhonDatII, Verbmobil) were labelled by three phoneticians and then used to adjust t...
متن کاملTC-STAR: Specifications of Language Resources and Evaluation for Speech Synthesis
In the framework of the EU funded project TC-STAR (Technology and Corpora for Speech to Speech Translation), research on TTS aims on providing a synthesized voice sounding like the source speaker speaking the target language. To progress in this direction, research is focused on naturalness, intelligibility, expressivity and voice conversion both, in the TC-STAR framework. For this purpose, spe...
متن کاملAutomatic Detection of Emphasized Words for Performance Enhancement of a Czech ASR System
This paper deals with a problem of prosodically emphasized word detection in Czech speech. The main goal is to propose an automatic emphasized word detection system that would be component of an Automatic speech recognition system (ASR) and would enrich its text output with highlighting emphasized words. The detection method is based on Czech prosodic rules and uses speech signal intensity, pit...
متن کامل